Introduction¶
This project will be grouping people into clusters based on their features.
What is clustering and how does it work?¶
Clustering is an unsupervised learning technique that groups similar data points together into clusters. It works by finding patterns and structures in unlabeled data, where points within a cluster are more similar to each other than to other points in other clusters. Some common algorithms are K-means, Agglomerative clustering, and DBSCAN.
Data Introduction¶
- Customer ID
- Gender
- Age
- Annual Income
- Spending Score - Score assigned by the shop, based on customer behavior and spending nature
- Profession
- Work Experience - in years
- Family Size
Decision Making for Modeling¶

Data Info¶
import polars as pl
import seaborn as sns
import matplotlib.pyplot as plt
import altair as alt
# load data
df = pl.read_csv("data/Customers.csv")
df.head()
| CustomerID | Gender | Age | Annual Income ($) | Spending Score (1-100) | Profession | Work Experience | Family Size |
|---|---|---|---|---|---|---|---|
| i64 | str | i64 | i64 | i64 | str | i64 | i64 |
| 1 | "Male" | 19 | 15000 | 39 | "Healthcare" | 1 | 4 |
| 2 | "Male" | 21 | 35000 | 81 | "Engineer" | 3 | 3 |
| 3 | "Female" | 20 | 86000 | 6 | "Engineer" | 1 | 1 |
| 4 | "Female" | 23 | 59000 | 77 | "Lawyer" | 0 | 2 |
| 5 | "Female" | 31 | 38000 | 40 | "Entertainment" | 2 | 6 |
# check if there's more than 50 samples, there should be 2000
df.shape
(2000, 8)
Since, I want to predict which profession someone might be in based on their income, family size, and other factors, we are predicting a category, which is profession.
Let ignore the next decision and just make a "NO" decision on labeled data, which brings us to the clustering section of the sklearn algorithm cheat sheet.
The next question will be if we know how many categories there are, and we can find that out. We'll also check out the null count, since people might not have a profession, and we'll just fill those nulls with another category called "No Profession".
df.null_count()
| CustomerID | Gender | Age | Annual Income ($) | Spending Score (1-100) | Profession | Work Experience | Family Size |
|---|---|---|---|---|---|---|---|
| u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
| 0 | 0 | 0 | 0 | 0 | 35 | 0 | 0 |
df = df.with_columns(
pl.col("Profession").fill_null("No Profession")
)
df.null_count()
| CustomerID | Gender | Age | Annual Income ($) | Spending Score (1-100) | Profession | Work Experience | Family Size |
|---|---|---|---|---|---|---|---|
| u32 | u32 | u32 | u32 | u32 | u32 | u32 | u32 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Data Visualization¶
Now let's check for how many categories we have.
df["Profession"].value_counts(sort=True)
| Profession | count |
|---|---|
| str | u32 |
| "Artist" | 612 |
| "Healthcare" | 339 |
| "Entertainment" | 234 |
| "Engineer" | 179 |
| "Doctor" | 161 |
| "Executive" | 153 |
| "Lawyer" | 142 |
| "Marketing" | 85 |
| "Homemaker" | 60 |
| "No Profession" | 35 |
We have 10 categories for Profession, and to visualize the distribution of professions:
chart = df["Profession"].value_counts(sort=True).plot.bar(x="Profession", y="count")
chart = chart.properties(width=700, height=400, title="Profession Count")
chart
- Most people are a type of artist in this dataset, and the second most work in healthcare.
Let's look at the gender distribution.
alt.Chart(df).mark_bar().encode(
x="Gender",
y="count(Gender)"
).properties(width=700, height=400)
- Majority are female, so we have a bias towards the female gender, so we may need to find out if our measures of central tendency and are significantly different.
def vconcat_bar(focus_var: str, title: str, color: str=None, width: int=700, height: int=400):
base = alt.Chart(df).mark_boxplot().encode(
x=alt.X("Age"),
y=alt.Y(focus_var),
color=color if color != None else focus_var
).properties(
width=width,
height=height
)
for col in df.columns:
if col in ("Age", "Gender", "Profession"):
continue
temp_chart = alt.Chart(df).mark_boxplot().encode(
x=alt.X(f"{col}"),
y=alt.Y(focus_var),
color=color if color != None else focus_var
).properties(
width=width,
height=height
)
base = base & temp_chart
return base.properties(title=alt.Title(title))
vconcat_bar("Gender", "Checking Class Balance")
Everything looks pretty much even, there's not much difference between any of these variables regarding the gender. We can conclude that gender doesn't really affect anything else, since everything looks pretty much the same for both males and females. We can also check the profession to see anything special.
So, we can move on to the next step of our sklearn algorithm cheat sheet.
Since, we have less than 10k samples, we can just go straight to the model, which is using KMeans, and then we might use Spectral Clustering and GMM.
vconcat_bar("Profession", "Checking Class Balance - Profession")
Now, we can move on to preprocessing the data.
Preprocessing¶
Our plan is to standardize the data to make our variables more comparable, and to use One-Hot encoding for our categorical variables using a pipeline. After that, we can move on to our KMeans model.
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
df = df.drop("CustomerID")
numeric_cols = [col for col in df.columns if df[col].dtype in (pl.Float64, pl.Int64)]
categorical_cols = [col for col in df.columns if df[col].dtype in (pl.Categorical, pl.String)]
numeric_cols, categorical_cols
(['Age', 'Annual Income ($)', 'Spending Score (1-100)', 'Work Experience', 'Family Size'], ['Gender', 'Profession'])
numeric_transformer = Pipeline(
steps=[("standardizer", StandardScaler())]
)
categorical_transformer = Pipeline(
steps=[("OneHot", OneHotEncoder(handle_unknown="ignore", sparse_output=False))]
)
preprocessor = ColumnTransformer(
transformers=[
("num", numeric_transformer, numeric_cols),
("cat", categorical_transformer, categorical_cols),
],
remainder="passthrough"
)
preprocessor.set_output(transform="polars")
pipe = Pipeline(steps=[("preprocessor", preprocessor)])
pipe
Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('num',
Pipeline(steps=[('standardizer',
StandardScaler())]),
['Age', 'Annual Income ($)',
'Spending Score (1-100)',
'Work Experience',
'Family Size']),
('cat',
Pipeline(steps=[('OneHot',
OneHotEncoder(handle_unknown='ignore',
sparse_output=False))]),
['Gender', 'Profession'])]))])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
| steps | [('preprocessor', ...)] | |
| transform_input | None | |
| memory | None | |
| verbose | False |
Parameters
| transformers | [('num', ...), ('cat', ...)] | |
| remainder | 'passthrough' | |
| sparse_threshold | 0.3 | |
| n_jobs | None | |
| transformer_weights | None | |
| verbose | False | |
| verbose_feature_names_out | True | |
| force_int_remainder_cols | 'deprecated' |
['Age', 'Annual Income ($)', 'Spending Score (1-100)', 'Work Experience', 'Family Size']
Parameters
| copy | True | |
| with_mean | True | |
| with_std | True |
['Gender', 'Profession']
Parameters
| categories | 'auto' | |
| drop | None | |
| sparse_output | False | |
| dtype | <class 'numpy.float64'> | |
| handle_unknown | 'ignore' | |
| min_frequency | None | |
| max_categories | None | |
| feature_name_combiner | 'concat' |
passthrough
So, now that our preprocessor is ready, now we can test our KMeans model as baseline.
Modeling¶
KMeans clustering partitions a dataset into K distinct clusters. We must first specify the desired number of K clusters, then the KMeans algorithm will assign each observation to exactly one of the K clusters. Since, each observation belongs to at least one of the K clusters, then the clusters are non-overlapping (which means that an observation can only belong to one cluster via the first property). To determine a "good" cluster, the KMeans algorithm uses the "within-cluster variation", which we want to be as small as possible. So, we want to partition the observations into K clusters so that the total within-cluster variation, which is summed over all K clusters, is as small as possible.
from sklearn.cluster import KMeans, SpectralClustering
from sklearn.mixture import GaussianMixture
df_scaled = pipe.fit_transform(df)
df_scaled[:3]
| num__Age | num__Annual Income ($) | num__Spending Score (1-100) | num__Work Experience | num__Family Size | cat__Gender_Female | cat__Gender_Male | cat__Profession_Artist | cat__Profession_Doctor | cat__Profession_Engineer | cat__Profession_Entertainment | cat__Profession_Executive | cat__Profession_Healthcare | cat__Profession_Homemaker | cat__Profession_Lawyer | cat__Profession_Marketing | cat__Profession_No Profession |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 | f64 |
| -1.054089 | -2.093501 | -0.428339 | -0.791207 | 0.117497 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| -0.983723 | -1.656133 | 1.075546 | -0.281162 | -0.390051 | 0.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| -1.018906 | -0.540845 | -1.609962 | -0.791207 | -1.405148 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
new_num_cols = list(pipe.named_steps["preprocessor"].named_transformers_["num"].get_feature_names_out())
new_cat_cols = list(pipe.named_steps["preprocessor"].named_transformers_["cat"].get_feature_names_out())
all_cols = new_num_cols + new_cat_cols
kmeans = KMeans(n_clusters=2, random_state=42)
kmeans_clusters = kmeans.fit_predict(df_scaled)
So, this will be a little bit hard to visualize since we have multiple variables. We can use Principal Component Analysis (PCA) for dimensionality reduction. The first principal component direction is the one where the observations vary the most.
PCA is a dimensionality reduction and interpretaion technique and is used for explaining the variance-covariance structure of a set of variables through linear combinations. The variability of a dataset can be accounted for by a small number of principal components. The way it works is that the first component captured the maximum variability and provides a more simple
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
pca.fit(df_scaled)
scores = pca.transform(df_scaled)
pca_components = pca.components_
pca_explained_variance = pca.explained_variance_
pca_mean = pca.mean_
scores[:, 0]
array([-1.88619617, -1.68839776, -1.6426034 , ..., 0.1581532 ,
1.20200114, -0.06268867], shape=(2000,))
def plot2d(df: pl.DataFrame, clusters, title):
df_copy = df.__copy__()
model_pca_df = df_copy.with_columns([
pl.Series("PC1", scores[:, 0]),
pl.Series("PC2", scores[:, 1]),
pl.Series('Cluster', clusters.astype(str))
])
chart = alt.Chart(model_pca_df).mark_circle(size=200).encode(
x=alt.X('PC1:Q', title="PC1"),
y=alt.Y('PC2:Q', title="PC2"),
color=alt.Color("Cluster:N"),
tooltip=["PC1", "PC2"]
).properties(
width=700,
height=400,
title=title
).interactive()
return chart, model_pca_df
def create_pca_chart(df: pl.DataFrame):
line_segments = []
for i, (comp, var_ratio) in enumerate(zip(pca.components_, pca.explained_variance_ratio_)):
scale_factor = 3.5
print(comp)
print(comp[0], comp[1])
x_end = comp[0] * scale_factor
y_end = comp[1] * scale_factor
line_segments.append({
'component': f'PC{i+1}',
'x': [0, x_end],
'y': [0, y_end],
'variance': f'{var_ratio*100:.1f}%',
'index': i
})
points = alt.Chart(df).mark_circle(size=80, opacity=0.45, color="olivedrab").encode(
x=alt.X('PC1:Q', scale=alt.Scale(domain=[-4, 4]), title='PC1'),
y=alt.Y('PC2:Q', scale=alt.Scale(domain=[-4, 4]), title='PC2')
)
lines_list = []
for seg in line_segments:
seg_df = pl.DataFrame({
'x': seg['x'],
'y': seg['y'],
'component': [seg['component']] * 2,
'variance': [seg['variance']] * 2
}, strict=False)
line_chart = alt.Chart(seg_df).mark_line(size=3).encode(
x='x:Q',
y='y:Q',
color=alt.value(['#e41a1c', '#377eb8'][seg['index']]),
tooltip=['component:N', 'variance:N']
)
lines_list.append(line_chart)
arrow_points = []
for seg in line_segments:
arrow_points.append({
'x': seg['x'][1],
'y': seg['y'][1],
'component': seg['component'],
'variance': seg['variance']
})
arrow_df = pl.DataFrame(arrow_points)
arrows = alt.Chart(arrow_df).mark_point(
size=150,
shape='triangle-up',
opacity=0.8
).encode(
x='x:Q',
y='y:Q',
color=alt.Color('component:N', scale=alt.Scale(domain=['PC1', 'PC2'], range=['#e41a1c', '#377eb8'])),
tooltip=['component:N', 'variance:N']
)
chart = (points + lines_list[0] + lines_list[1] + arrows).properties(
width=600,
height=600,
title='PCA of 2 Components'
).interactive()
return chart
kmeans_chart, kmeans_pca_df = plot2d(df, kmeans_clusters, title="KMenas PCA")
kmeans_chart
create_pca_chart(kmeans_pca_df)
[ 2.12713935e-01 6.71590046e-01 -4.00761702e-02 4.57196723e-01 5.40644823e-01 -1.06826284e-02 1.06826284e-02 -1.68535783e-02 -3.90854035e-04 -7.99327354e-04 -2.33938600e-03 7.49436717e-03 8.61961206e-03 7.21376074e-03 -7.05953343e-03 -1.71577852e-03 5.83071768e-03] 0.21271393498011548 0.6715900460082179 [ 6.93606596e-01 -2.03143559e-01 -6.56509474e-01 -1.92798601e-01 9.34824341e-02 2.03614686e-04 -2.03614686e-04 -1.57977935e-02 -7.85610839e-03 1.84469389e-02 5.86673915e-03 -3.19552397e-03 -1.94886746e-03 -1.60650682e-03 3.17775658e-03 -8.47458421e-04 3.76082397e-03] 0.6936065955912818 -0.20314355901760364
from sklearn.metrics import silhouette_score
sil_score = silhouette_score(df_scaled, kmeans_clusters)
sil_score
0.11703237208125648
Since, we have an outrageous score of 0.11 for our silhouette score and the data, according to our PCA analysis, is very much overlapping with 2 principal components. And you can even see the shape of it, it doesn't matter how many clusters we put, everything is just compacted into one area for now.
Spectral Clustering¶
Spectral clustering is an unsupervised learning algorithm that uses connectivity between data points to form the clustering. It uses eigenvalues and eigenvectors of the data to cast the data into lower dimensions, kind of like PCA. First step on using spectral clustering by hand would be to build a similarity graph. Each vertex in an undirected graph represents a data point. Two vertices are connected by an edge if the similarity between the data points are positive or larger than a certain threshold with the edge of a vertix weighted by similarity that is non-negative, meaning that it can be 0 or more. But if a weight is zero, than the vertices aren't connected. Then, find a partition where edges between different groups have low weights and edged within a group have high weights. This is a general clustering definition, where points in one group are similar to other points in that same group, and different from another group. This similarity graph is built on an epsilon-neighbourhood graph and K-Nearest Neighbors. Epsilon and the K parameter are fixed beforehand.
In the epsilon-neighborhood graph, each point is connected to all points in the epsilon radius, and if the two vertices are similar then usually the weights of the edges are not stored (the pairwise distances that are smaller than epsilon, and in math epsilon just means an "abitrarily small number" or it could also mean error, but in this case, since it's a fixed parameter, it's just a fixed number), thus the graph being built is unweighted.
The k-nearest neighbor graphs goal is to connect the a vertex to another vertex among the k-nearest neighbors of the first vertex. To make the graph undirected, we can first just ignore the directions of the edges and connect both vertices with an undirected edge if the first vertex is among the k-nearest neighbors of the second vertex OR vise-versa. The other way to do it is to connect the vertices if both the first vretex is among k-nearest neighbors of the second vertex AND vice-versa, which results to a mutual k-nearest neighbor graph. After connecting the vertices, we weight the edges by the similarity of their endpoints (like we said above).
The fully connected graph simply connects all data points that have positive similarity with each other, and weigh all the edges by similarity, where similarity is a square matrix.
spec = SpectralClustering(n_clusters=2, random_state=42)
spec_clusters = spec.fit_predict(df_scaled)
spec_chart, spec_pca_df = plot2d(df, spec_clusters, "Spectral Clustering PCA")
spec_chart
create_pca_chart(spec_pca_df)
[ 2.12713935e-01 6.71590046e-01 -4.00761702e-02 4.57196723e-01 5.40644823e-01 -1.06826284e-02 1.06826284e-02 -1.68535783e-02 -3.90854035e-04 -7.99327354e-04 -2.33938600e-03 7.49436717e-03 8.61961206e-03 7.21376074e-03 -7.05953343e-03 -1.71577852e-03 5.83071768e-03] 0.21271393498011548 0.6715900460082179 [ 6.93606596e-01 -2.03143559e-01 -6.56509474e-01 -1.92798601e-01 9.34824341e-02 2.03614686e-04 -2.03614686e-04 -1.57977935e-02 -7.85610839e-03 1.84469389e-02 5.86673915e-03 -3.19552397e-03 -1.94886746e-03 -1.60650682e-03 3.17775658e-03 -8.47458421e-04 3.76082397e-03] 0.6936065955912818 -0.20314355901760364
Gaussian Mixture (GMM)¶
A Gaussian Mixture model is an ML method that determines the probability of each point belonging to one or more clusters. It then calculates the overall likelihood of observing a data point under all Gaussians, which can be achieved by summing all possible Gaussians or clusters for each point. When fitting a GMM to the data, it uses Expectation-Maximization (EM), which is a popular alternative to the modern maximum likelihood estimator. It is an iterative method that optimizes coefficients byu calculating the posterior probability, which is the revised probability after factoring in new evidence, for each data point in a set of samples. First, it will calculate the probability of each data point belonging to each cluster, which is called the Expectation Step (E-step). Then, you would update the mixing weights and re-estimate the Gaussian distribution paramaters (Maximization step ), which would be mu (vector of means) and sigma (covariance matrix), and this process would keep repeating until the parameters don't significantly change.
gmm = GaussianMixture(n_components=2, random_state=42)
gmm_clusters = gmm.fit_predict(df_scaled)
gmm_chart, gmm_pca_df = plot2d(df, gmm_clusters, "Gaussian Mixture PCA")
gmm_chart
create_pca_chart(gmm_pca_df)
[ 2.12713935e-01 6.71590046e-01 -4.00761702e-02 4.57196723e-01 5.40644823e-01 -1.06826284e-02 1.06826284e-02 -1.68535783e-02 -3.90854035e-04 -7.99327354e-04 -2.33938600e-03 7.49436717e-03 8.61961206e-03 7.21376074e-03 -7.05953343e-03 -1.71577852e-03 5.83071768e-03] 0.21271393498011548 0.6715900460082179 [ 6.93606596e-01 -2.03143559e-01 -6.56509474e-01 -1.92798601e-01 9.34824341e-02 2.03614686e-04 -2.03614686e-04 -1.57977935e-02 -7.85610839e-03 1.84469389e-02 5.86673915e-03 -3.19552397e-03 -1.94886746e-03 -1.60650682e-03 3.17775658e-03 -8.47458421e-04 3.76082397e-03] 0.6936065955912818 -0.20314355901760364
def elbow_plot(df: pl.DataFrame, max_clusters: int=10, min_clusters: int=1):
wcss = []
for i in range(min_clusters, max_clusters+1):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(df)
wcss.append(kmeans.inertia_)
df_elbow = pl.DataFrame({
'Number of clusters': range(min_clusters, max_clusters+1),
"WCSS": wcss
})
chart = alt.Chart(df_elbow).mark_line(point=True).encode(
x=alt.X("Number of clusters:Q", title="Number of Clusters (K)"),
y=alt.Y("WCSS:Q", title="Within-Cluster Sum of Squares/Within-Cluster Variation")
).properties(
width=700,
height=400,
title='Elbow Method for Optimal K Clusters'
)
return chart.interactive()
elbow_plot(df_scaled)
- There's not really a clear optimal cluster
import plotly.express as px
pca = PCA(n_components=3, random_state=42)
pca.fit(df_scaled)
scores = pca.transform(df_scaled)
pca_components = pca.components_
pca_explained_variance = pca.explained_variance_
pca_mean = pca.mean_
scores[:, 0]
array([-1.88619617, -1.68839776, -1.6426034 , ..., 0.1581532 ,
1.20200114, -0.06268867], shape=(2000,))
kmeans = KMeans(n_clusters=3, random_state=42)
kmeans_clusters2 = kmeans.fit_predict(df_scaled)
kmeans_df = df.__copy__()
kmeans_pca3d_df = kmeans_df.with_columns([
pl.Series("PC1", scores[:, 0]),
pl.Series("PC2", scores[:, 1]),
pl.Series("PC3", scores[:, 2]),
pl.Series('Cluster', kmeans_clusters2.astype(str))
])
fig = px.scatter_3d(kmeans_pca3d_df, x="PC1", y="PC2", z="PC3", color="Cluster")
fig.show()
spec = SpectralClustering(n_clusters=3, random_state=42)
spec_clusters2 = spec.fit_predict(df_scaled)
spec_df = df.__copy__()
spec_pca3d_df = spec_df.with_columns([
pl.Series("PC1", scores[:, 0]),
pl.Series("PC2", scores[:, 1]),
pl.Series("PC3", scores[:, 2]),
pl.Series('Cluster', spec_clusters2.astype(str))
])
fig = px.scatter_3d(spec_pca3d_df, x="PC1", y="PC2", z="PC3", color="Cluster")
fig.show()
gmm = GaussianMixture(n_components=3, random_state=42)
gmm_clusters2 = gmm.fit_predict(df_scaled)
gmm_df = df.__copy__()
gmm_pca3d_df = gmm_df.with_columns([
pl.Series("PC1", scores[:, 0]),
pl.Series("PC2", scores[:, 1]),
pl.Series("PC3", scores[:, 2]),
pl.Series('Cluster', gmm_clusters2.astype(str))
])
fig = px.scatter_3d(gmm_pca3d_df, x="PC1", y="PC2", z="PC3", color="Cluster")
fig.show()
This data does not look suitable for clustering analysis. This problem will probably be better for classification or regression depending on what variable you choose. There is no underlying pattern within the data that can be detected by the three clustering algorithms and PCA method to view the data.
Impact¶
This data isn't suitable for clustering analysis. It's more of a regression/classification problem. Since, the data points are just grouped together with no separation, it's not really possible for any unsupervised learning algorithm like clustering to detect a pattern in this data. It looks like everything would belong to the same group. Besides the bias towards having more samples of women, the summary statistics of the data was pretty much the same as men, so the imbalanced samples of women didn't really change anything. If this wasn't a clustering project, I would do a classification model to predict which profession someone might be in based on the other variables. The impact is that we ruled out a type of model we can use, and now we can go back to the scikit-learn cheatsheet, and see what path we can take. The impact on myself is that I actually learned a lot, mostly the math from MIT papers, and a little bit of a more clearer explanation from Geeks for Geeks.